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Abstract 

Similarity networks are important abstractions in many information management applications such as 
recommender systems, corpora analysis, and medical informatics. For instance, in a recommender sys- 
tem, by inducing similarity networks between movies rated similarly by users, we can aim to find the 
global structure of connectivities underlying the data, and use the network to posit connections between 
given entities. We present an algorithmic framework to efficiently find paths in an induced similarity 
network without materializing the network in its entirety. Our framework introduces the notion of 'ham- 
mock' paths which are generalizations of traditional paths in bipartite graphs. Given starting and ending 
objects of interest, it explores candidate objects for path following, and heuristics to admissibly estimate 
the potential for paths to lead to a desired destination. We present three diverse applications, modeled 
after the Netflix dataset, a broad subset of the PubMed corpus, and a database of clinical trials. Ex- 
perimental results demonstrate the potential of our approach for unstructured knowledge discovery in 
similarity networks. 

1. Introduction 

There is significant current interest in modeling and understanding network structures, especially in 
online social communities, web graphs, and biological networks. We focus here on (unipartite) similarity 
networks induced from bipartite graphs, such as a movies x people dataset. Two movies can be connected 
if they have been rated similarly by sufficient number of people; this is the basis for the popular item- 
based recommendation algorithms H. A similarity network thus exposes an indirect level of clustering, 
community formation, and organization that is not immediately apparent from the bipartite network. 

Similarity networks are key abstractions in recommender systems but they also find uses in other in- 
formation management applications such as collection analysis and medical informatics. We focus on how 
these networks can be used for exploratory discovery, i.e., to see how similarities can be composed to reach 
potentially distant entities. For instance, how is the movie 'Roman Holiday' (a romantic classic) connected 
to 'Terminator 3' (a Sci-Fi movie)? What are the in-between movies and waypoints that help link these 
two disparate movies? Are these waypoints characteristic to the domain or to the dataset? For recommender 
systems, these questions are especially germane to tasks such as serendipitous recommendation, gift buying, 
and characterizing the movie watching interests of its users. 

In addition to recommender systems, we consider similarity networks induced from literature and semi- 
structured sources of data. Here path finding has applications to literature-based discovery (similar to the 
ARROWSMITH program [ 12]) and clinical diagnosis. For instance, how is congestive heart failure related 
to kidney disease? The discovered waypoints correspond to possibilities by which different disease response 
pathways interact with each other. 

Admittedly, these questions can be answered by first inducing the similarity network and running short- 
est path queries over it, but we desire to find paths without materializing the entire network. This would 
be wasteful because only a subnetwork is likely to be traversed in response to a query. Instead, we seek to 
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Figure 1: (Top) A simple path (hammock width=l) beginning at romantic classic "Roman Holiday" and ending at 
the Sci-Fi & Fantasy movie "Terminator III". (Middle) A hammock path between the same entities but with hammock 
width=5. (Bottom) A hammock path between the same entities with an additional clique size requirement of 4. Observe 
that each hammock inside the cliques continues to have width 5. Notice that the path length increases from (Top) to 
(Middle) and then drops in (Bottom). 



pre-process the original bipartite graph into suitable data structures that can then be harnessed to find chains 
of connections. 

In this paper, we present an algorithmic framework for traversing paths using the notion of 'hammock' 
paths which are generalization of traditional paths. Our framework is exploratory in nature so that, given 
starting and ending objects of interest, it explores candidate objects for path following, and heuristics to 
admissibly estimate the potential for paths to lead to a desired destination. Our key contributions are: 

1. We formulate the path finding problem over similarity networks in terms of hammocks and cliques, 
two intuitive-to-understand constructs for navigating similarity networks. Hammocks are ways to 
impose tighter requirements over individual links (leading to longer paths) and cliques are ways to 
coalesce multiple links (resulting in shorter paths). 

2. We present an algorithmic framework that finds hammock paths in both binary datasets (bipartite 
graphs) and vector-valued datasets (weighted bipartite graphs). Our framework uses a concept lat- 
tice representation as an in-memory data structure to organize the search for paths and introduces 
admissible heuristics that quickly prune out unpromising paths. 

3. We present compelling experimental results on three diverse datasets: the Netflix dataset of movie 
ratings, a significant portion of the PubMed corpus, and a database of clinical trials. Experimental re- 
sults demonstrate the scalability and accuracy of our approach as well as its potential for unstructured 
knowledge discovery. 

2. Background 

The formulations we study can be intuitively understood using Fig. [T] A simple path between movies 
can be induced through co-raters, as shown in Fig. [T](top): 'Roman Holiday' was rated by user u$ who 
also rated 'Rear Window' which was also rated by ug, and so on. In practice, paths are induced between 
two movies only if a sufficient number of people have rated them and rated them similarly. This leads 
to a sequence of 'hammocks' as shown in Fig. [T] (middle). A final level of generalization is to organize 
the hammocks into groups so that we traverse cliques of movies (with a hammock between every pair of 
them), see Fig. [T] (bottom). By finding such hammocks and traversing them systematically, similarities can 



'diffuse' to reach possibly distant entities. Note that the path length increases as the hammock requirement 
is strengthened and then decreases as the clique requirement is imposed. Work closest to ours can be found 
in the graph modeling, redescription mining, and lattice-based information retrieval literature. 
Graph modeling: The use of the word 'hammock' for induced similarity networks appears to have been 
first made in (H although this work does not aim to find paths. The notion of kNC -plots was introduced 
in [6] where k denotes what we refer to as hammock width in this paper. Rather than finding local paths, this 
paper is focused on finding the global connectivity structure of the induced networks as the hammock width 
is increased. It is also restricted to binary spaces whereas we focus on vector valued spaces as well. Random 
intersection graphs [ 13 ] are a class of theoretical models proposed in the random graph community. These 
models randomly assign each vertex with a subset of a given set and posit edges if these subsets intersect. 
Under these modeling assumptions, connectivity and other properties of these graphs can be characterized 
0. 

Redescription mining: Redescriptions, a pattern class introduced in ifTTl . induce subsets of data that share 
strong overlap. A hammock in our notation can be viewed as a redescription between the objects it connects. 
Kumar et al. Q study the problem of 'storytelling' over a space of descriptors, which is the goal of finding a 
sequence (story) of redescriptions between two subsets by positing intermediate subsets that share sufficient 
overlap with their neighbors in the story. Although this is similar to our objective, the heuristic innovations 
in Q are restricted to binary spaces as well and cannot find paths of the same expressiveness (hammocks 
organized into cliques) and with the same efficiencies as done here. 

Lattice based information retrieval: The use of concept lattices as organizing data structures for fast 
retrieval, query, and data mining is not new flUEl. O ur work is different in both the theoretical framework 
by which the lattice concepts are used to build chains and in their uses/applications. 

3. Problem Formulation 

We begin by formally defining the space over which similarity networks are induced. 

Definition 1. A dataset V = (O, F, V) is a set of objects O, a set of features F, and a relation V C O x F. 

We can thus think of each / G F as inducing a subset of O, which we call o(f). We further require 
that the objects induced by F form a covering of O, i.e. o(f) = O. In the applications studied here, 

the (O, F) are (movies, users), (documents, terms), and (clinical trials, keywords), respectively, with the 
obvious meaning of V in each case. In Fig. [2] we introduce a dataset which we will use as a running 
example. Here the objects are movies (O = {mi, m2, 7713, rn,4, m^,mQ, m-j, mg}) and the features are users 
(F = {ma, u b , uc, u d , u e ,u f , u g }). 

A note on notation: We use calligraphic O and F to represent the complete object and feature sets 
for some database. Lower case letters o and / are used to represent individual members of O and F, 
respectively. To represent subsets of O and F we use upper case letters O and F, resp. 

For a feature set F C F, we overload notation and define the operator o(F) : F — > O as the set of 
objects which are associated with all of the features in F, i.e., 

o(F) = n ad) 

feF 

We also define a parallel operator f(0) as the set of features associated with the object set O. In Fig. [2] 
we have o({u A , u D , u E }) = {m 1 ,m 3 ,m 5 ,m 7 } and f({m 1 , m 3 , m 5 , m 7 }) = {u A ,u D ,u E }. 

We can now define the closure operator c : 2 T — > 2 T as c(F) = f(o(F)). A feature set F is closed 
if and only if c(F) = F. (A parallel closure operator exists on 2°, but is not necessary for our purposes.) 
Using our running example from Figure[2]we see that feature set {ua, ud, u e } is closed, whereas {ua, u e } 
is not, since c({u A , u D }) = {ua, u d , u e ] ^ {u A , u D }. 
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Figure 2: A dataset and its concept lattice. 

Definition 2. For each feature / G F, we define a predicate p/(o) such that pf(o) is true if and only if 
o G o(/). The set of all such predicates is denoted by V. A descriptor is a boolean expression over V. 

A descriptor thus provides a mechanism to specify a set of objects which correspond to some Boolean 
expression on features. Again referring to Figure [2] we might define a descriptor ua(o) V ub{o), which 
would correspond to all the movies which were rated by user A, by user B, or by both; it induces the set 
{mi,m3, 771,5, rnQ, mj}. It is easy to see that we can create a descriptor for the objects in o(F) by creating 
a conjunction over the features in F. For simplicity, we often speak of descriptor F for some feature set, 
which refers to this conjunction. 

We employ the notion of redescriptions introduced by Ramakrishnan et al. ifTTIl to capture similarities 
between descriptors. If for descriptors F and F', where F is tautologically distinct from F 1 (i.e. F V -<F' is 
not a tautology), it is the case that o(F) = o(F'), we say that F and F' are redescriptions of each other, as 
they induce the same set of objects. In other words, a redescription provides (at least) two different ways to 
describe a given object set. We next introduce the concept of a redescription set: 

Definition 3. A redescription set for a descriptor F, Rp = (O', F') is a tuple where O' = o(F) and 
F' = c(F). 

Fig. [2] shows many redescription sets organized alongside a concept lattice (to be defined soon). One 
redescription set is: Cj = ({mi, 7773, 7775, 7777}, {ua, ttn, ue}), which would correspond to both descriptors 
{ua, u D , u E ] and {u A , u D }. 

We say that a descriptor F is a relaxation of descriptor F', denoted by (F < F'), if the feature set 
F = {/1,/2,-,/m} Q {/1,/2,-,/n} = F '- In Fig.[2| {u A ,u B } < {u A ,u B ,u c }. The following is easy 
to verify: 

Lemma 1. F < F' o{F) D o(F'). 

Finally, we bring in the notion of a concept lattice, the lattice defined over the redescription sets using 
the operator < with a join of F and a meet of 0. Returning once again to our running example, Figure [2] 
shows a concept lattice with the lower nodes being relaxations of the higher nodes in the lattice. In this 
space we say that a descriptor F is a child of F' if and only if F < F' and VF" : F < F" < F' ->• (F" = 
F) V (F" = F'). Thus in our running example C10 has two children: C4 and C7. 

Definition 4. The Jaccard coefficient between two objects is defined as 

J( s l/(oi) n/(Q2)| 

</(0l,0 2 = 

|/(oi) U/(o 2 )| 



The Jaccard coefficient is thus a measure of how similar two objects are based upon how many features 
they share in proportion to their overall size. 

Definition 5. The Soergel distance between two objects o\ and 02 is given by 

n( l/(oi)l + l/(o 2 )| -2|/( 0l )n/(o 2 )| 

1 15 2> 1/(01)1 + 1/(08)1- |/(oi) n/Coa)! 

Unlike the Jaccard coefficient (which is a similarity measure), the Soergel distance is a true distance 
measure: it is exactly 0.0 when the objects o\ and 02 have exactly the same features, is symmetric, and 
obeys the triangle inequality. We also note that the Soergel distance is exactly 1 .0 when the induced feature 
sets are disjoint. 

We now define two types of paths — clique paths and hammock paths — which we use to constrain our 
path generation. 

Definition 6. A hammock of width w is a tuple (01, 02) G x where |/(oi) n f{o2)\ = w. 

A hammock is thus a pair of objects which share at least w common features. Another way to think of a 
hammock is as a redescription set containing two objects ({01} and {02}), with a feature set which contains 
at least w features. Observe that a hammock is a basic unit of similarity modeling in recommender systems 
and many other applications: it posits similarity in one domain using relationships with another domain. We 
now define hammock paths composed of hammocks. 

Definition 7. A hammock path with width w, from object o\ to object o t , denoted by H w (o\, ot), is a series 
of objects (01,02, Of-i, Of) such that Vi : 1 < i < t — 1 the pair (oj, Oj+i) is a hammock of width at least 
w. 

We now define a clique which extends the concept of a hammock to a set of objects rather than a single 
pair: 

Definition 8. A /c-clique, q k,w , is a set of k objects O = {01,02, . . . , o/J such that Voj, Oj G O : |/(oj) n 
f(oj)\ > w, for some width w. The function T maps a clique to the set of k objects contained in said clique. 

Note that a fc-clique is defined over a collection of k(k — 1)/2 hammocks of width at least w; the overlap 
threshold applied to every pair of objects of a clique ensures that there is at least as high an overlap between 
any two objects in the clique. 

Definition 9. A /c-clique path Q k ' w (o\, ot) is a series of t — 1 consecutive /c-cliques Q = (qi,q2, ■ ■ ■ , Qt-i) 
such that 01 G T(fi), o t G T(ft_i), and Vg,eQ:T ( 9i ) n T (%+i) / 0. 

The problem we seek to address can now be given as a pair of constraints upon the clique and hammock 
paths between two objects, formally defined as follows: 

Problem 1. Given D = (O, J-), start and end objects o\ G O and ot G O, we seek to find a chain/series 
of objects Sij = (01, 02, ... j Of) where S\ t t is a hammock path H w (o±, ot) for some width w, and further, 
there is some k-clique path Q k ' w (oi, ot) = (qi, 92, • • • ; Qt-i) where Vi : 1 < i < t — 1, Oj G T(gj). 

4 Extension to Vector Spaces 

To accommodate vector spaces, we generalize Definition [I] so that a database V = (O, J 7 , V) is now a 
tuple defined over a set of objects O, features F, and a function V : x F — > M. This helps us go beyond 
simple binary associations to record the strength of the association (e.g., rating values, term weights) or 
other auxiliary continuous- valued data. (Our running example of Figure|2]contains movie ratings by users.) 
The weighted Soergel distance between two objects 01 and 02 is defined by 



J2\V(ox,f)-V(o 2 J)\ 

0(01,02) = 

^max(y( 0l ,/),y(o 2 ,/)) 

which reduces to the unweighted case if V(o, /) = 1 for any object o £ o(/) and V(o, /) = for any 
object o o(f). Definitions [6j|7}[8j|9] get similarly generalized. In place of the hammock width constraint 
w, we use a weighted Soergel distance threshold 9 that must be satisfied between the vectors that aim to 
form hammocks, hammock paths, cliques, or clique paths. In place of the notation H w {o\,ot), we use 
H (01, Of), and so on. The new problem becomes: 

Problem 2. Given T> = (O, J 7 , V), start and end objects o\ G O and ot £ O, we seek to find a chain/series 
of objects Sit = °2? • ■ • j °t) where Sij is a hammock path H e {o\, Ot) for some distance 9, and further, 
there is some k-clique path Q k,e (o\, o t ) = (q\, q2, ■ ■ ■ , qt-i) where V« : 1 < i < t — 1, Oj G T(q>j). 

5. Algorithms 

Our overall methodology is based on using the concept lattice to structure the search for paths. Recall 
that the two parameters influencing the quality of the path — hammock width and clique size — impose a 
duality. The hammock width is posed over feature sets but the clique size is over objects. We use the clique 
size to prune the concept lattice during construction (by incorporating it as a support constraint) and the 
hammock width to select candidates for dynamic construction of paths. There are three key computational 
stages: (i) construction of the concept lattice, (ii) generating promising candidates for path following, and 
(iii) evaluating candidates for potential to lead to destination. Of these, the first stage can be viewed as a 
startup cost that can be amortized over multiple path finding tasks. 

We adopt the CHARM-L fPfll algorithm of Zaki for constructing concept lattices and mining redescrip- 
tions. The second and third stages are organized as part of an A* search algorithm that begins with the 
starting object, uses the concept lattice to identify candidates satisfying the hammock and clique size re- 
quirements, and evaluates them heuristically for their promise in leading to the end object. In practice, we 
will place a limit on the branching factor (6) of the search, thus sacrificing completeness for efficiency. We 
showcase these steps in detail below, including the construction of admissible heuristics. 

5.1. Successor Generation 

Successor generation is the task of, given an object, using the hammock and clique size requirements, 
to identify a set of possible successors for path following. Note that this does not use the end object in its 
computation. We study three techniques for successor generation: 

1 . Cover Tree Nearest Neighbor, 

2. Nearest Neighbors Approximation (NNA), and 

3. A;-Clique Near Neighbor (fcCNN). 

The first technique is targeted toward finding paths when the clique size requirement is 2 (top and middle 
paths of Fig. [I]). That is, this technique is able to generate b singleton successors, where b is the branching 
factor of the search. The second two techniques concentrate on paths of any clique size, such as the bottom 
path of Fig. [T] Instead of generating singleton successors, the NNA and A;CNN algorithms are able to 
generate successor-sets, where each of these sets constitutes a candidate fc-clique with the given object. 



5.1.1 Cover Tree Nearest Neighbor 



The cover tree Hi is a data structure for fast nearest neighbor operations in a space of objects organized 
alongside any distance metric (here, we use the Soergel distance Q ). The space complexity is O(||0||), 
i.e., linear in the object size of the database. A nearest neighbor query requires logarithmic time in the 
object space O (c 12 log (n)) where c is the expansion constant associated with the featureset dimension of 
the dataset (see HI for details). 

5.1.2 Nearest Neighbors Approximation (NNA) 

The second mechanism we use for successor generation is to approximate the nearest neighbors of an object 
using the concept lattice. We use the Jaccard coefficient between two objects as an indicator to inversely 
(and approximately) track the Soergel distance between the objects. In order to efficiently calculate an 
object's nearest neighbors, however, we cannot simply calculate the Jaccard coefficient between it and every 
other object. We harness the concept lattice to avoid wasteful comparisons. We first define a predicate r on 
redescriptions and objects. 

Algorithm 1 NNA(o) 
Input: An object o G O 

fringe <— T(o) order by feature set size 
prospects 
while fringe ^ do 

r <— dequeue from fringe 
while prospects / do 
d head prospects 

ifW)>S^then 

yield o' 

dequeue prospects 
else 

break 
end if 
end while 

for all r' G ChildrenOf r do 

add r' to fringe order by feature set size 

for all d in o(r') do 

add d to prospects order by J(o, d) 

end for 
end for 

end while 



Definition 10. For redescription Rp = (O, F) and object o, r(o, Rf) if and only if o € o(F) and there is 
no F' such that F < F' and o G o(F'). 

Informally, we can say r(o, Rf) if F is on the edge of redescriptions which contain the object o. In 
Figure [2j using object mi the only feature set F for which r(mi,i?p) would evaluate to true is F = 
{ua,ud,ue}- Note that if no support threshold (clique constraint) is given, there will always be exactly 
one such feature set for each object. When using a support threshold greater than one, this is no longer true; 
so we now define the set T(o, 2^) for object o. 



Definition 11. For object o and all redescriptions 2 T in a concept lattice containing o, T(o, 2^) = {F : 
F G 2 J AT(o,i? F )}. 

We generally omit the 2 T where it is clear from context. The set T(o) is then the set of all redescriptions 
which form the upper edge in the concept lattice where o appears. All the objects in Figure[2]have singleton 
sets for T, however, if we change the support threshold from one object to five objects, then the object m\ 
will have T(rrti) = {{u A }, {u D ,u E }}- 

A formal description of our NNA algorithm is shown in Algorithm [T] 

Definition 12. NNAfc(o) is the k th object returned by the NNA algorithm when run on input object o. 

Theorem 1. Given a database with object set O with size \0\ = n containing object o G O, £ Z : 
1 < i < j < n -> J(o,NNAi(o)) > J{o, NNAj(o)). 

(Proof omitted due to space constraints.) In other words, NNA returns better approximate redescriptions 
of an object first. This is still, however, an approximation since it uses the Jacccard coefficient rather than 
the Soergel distance. 

5. 1.3 A;-Clique Near Neighbor ( fcCNN) 

The basic idea of the A;CNN approach is, in addition to finding a good set of successor nodes for a given 
object o, to be able to have sufficient number of them so that, combinatorially, they contribute a desired 
number of cliques. With a clique size constraint of k, it is not sufficient to merely pick the top k neighbors 
of the given object, since the successor generation function expects multiple clique candidates. (Note that, 
even if we picked the top k neighbors, we will still need to subject them to a check to verify that every 
pair satisfies the hammock width constraint.) Given that this function expects b clique candidates, minimum 
number m of candidate objects to identify can be cast as the solution to the inequalities: 

(V)<»-(T)*» 

The object list of each concept of the lattice is ordered in the number of features (e.g., see Fig. [2]) and 
this aids in picking the top m candidate objects for the given object o. We pick up these m candidate objects 
for o from the object list of the concept containing the longest feature set and redescription set containing 
o. Note that, in practice, the object list of each concept is much larger than m and as a result fcCNN does 
not need to traverse the lattice to obtain promising candidates. /cCNN thus forms combinations of size k 
from these m objects to obtain a total of b /c-cliques. Since m is calculated using the two inequalities, the 
total number of such combinations is equal to or slightly greater than b (but never less than b). Each clique 
is given an average distance score calculated from the distances of the objects of the clique and the current 
object o. This aids /cCNN in returning a priority queue of exactly b candidate /c-cliques. 

5.2. Evaluating Candidates 

We now have a basket of candidates that are close to the current object and must determine which of 
these has potential to lead to the destination. We present two operational modes to rank candidates: (i) the 
normal mode and (ii) the mixed mode. 

5.2.1 Normal Mode 

The normal mode is suitable for the general case where we have all the objects and features resident in the 
database. The primary criteria of optimality for the A* search procedure is the cumulative Soergel distance 
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Figure 3: Distribution of common and uncommon features of objects o\ and 02 inside and outside the concept lattice. 
(The hexagon indicates the concept lattice). 



of the path. We use the straight line Soergel distance for the heuristic. It is easy to see that this can never 
oveestimate the cost of a path from any object o to the goal (and is hence admissible), because the Soergel 
distance maintains the triangle inequality. 



5.2.2 Mixed Mode 

The mixed mode distance measure is effective for large datasets where only important information are stored 
but other information are removed from the system after recording some of their aggregated information, 
to save space and cost of establishment. With the mixed mode approach, for simplicity, we assume that all 
the information about items outside the concept lattice are absent but some of their aggregated information 
like number of features truncated are provided. Fig. [3] shows the distribution of common and uncommon 
features of objects o\ and 02 inside and outside a concept lattice. The mixed mode Soergel distance is given 
by: 

n ( x \U 1L \ + \U 2L \ 

Mnixed(Ol, 02) ~ 



\U 1L \ + \U 2L \ + \C L \ + \T \ 
where the terms are defined in Fig. [3] 

Theorem 2. Amw(oi, 02) never overestimates the original Soergel distance D(oi, 02). 
Proof: Omitted due to more general Theorem[3]later. 



5.2.3 Mixed Mode in Vector Spaces 

However, the mixed mode becomes more complex in the vector space model than its set model. Figure [4] 
shows our formula for the mixed mode in the vector space. Similar to the unweighted one, we assume that 
all the information about features of objects outside the concept lattice are absent in the weighted mixed 

Unmixed (,Ol,0 2 ) - \ UlL \ xm i nv/ ( 0l ) + \u 2L \xminv/(o 2 )+J2f e c L max(^(oi ,f),V{o 2 ,/))+|T |xmax(maxw(oi),maxw(o 2 )) 

Figure 4: The mixed mode Soergel distance equation. 



mode approach. But some of their aggregated information like minimum (minw) and maximum (maxw) 
weights are provided. 

Consider the set of features To that do not appear in the lattice due to the support threshold of the 
concept lattice, minsup. Some features of To can be common to both objects 01 and 02. \U\o\ an d \U<zo\ 
are the numbers of uncommon features in objects o\ and 0%, which are thus outside the frequent concept 
lattice. Length | To | is a known variable due to the recorded aggregated information, but \U\o\ an d \Uio\ 
are unknown. This is why \U\o\ an d If/20 1 do not appear in D m ixed(oi> 02)- For D m i xe d(oi> 02 )> we consider 
that all the features of To (i.e., features outside the lattice) are common in both objects o\ and 02 and all 
these features have the same weight which is max(maxw(oi), maxw(o2)). 

Theorem 3. HW- e£ /(oi, 02) never overestimates the original Soergel distance D(oi, 02). 

Proof: Let the numerator of ©(01,02) and D m ixed(oi, 02) be 77 1 ' 2 and rj^? d respectively. Similarly, 
let ro 1 ' 2 and rocked rje ^ e denominators of ©(01,02) and D m ixed(oi, 02). It is clear that r/^'? ed < i] 1,2 . 
Therefore, it suffices to show that ^ m ' ixed > w 1,2 . 

For ease of derivation, we define the following notation : 

Oi = W\l\ x minw (01) + \ U2l\ x minw (02) 

P = J2 max(F( 0l ,/),y( 02 ,/)) 
feC L 

C = Yl m a x(V(o l7 f),V(o 2 ,f)) 

fe(To-U 10 ~U 2 o) 
X = \U\\ x minw {o\) + \U2\ X minw (02) 
C = \U\o\ x minw (01) + 1 1/20 1 x minw (02) 
q = \Tq\ x max (maxw (01) , maxw (o 2 )) 

Therefore we have, tu 1 ' 2 = \ + P + C- Now > ro mixed = a + P + Q = X + l3 + (- £, + Q- ( = 
■cu 1 ' 2 + q — Q — £ = ru 1 ' 2 + L, where L = q — ( — £. Since L > is always true, ro^ed — tul ' 2 - Therefore, 

EVixed (Ol, 2 ) < B(oi, 2 ). □ 

6. Experimental Results 

We present our experimental results on four datasets (see Table [TJ using a 64-bit Windows Vista Intel 
Core2 Quad CPU Q9450 @ 2.66GHz and 8 GB physical memory. 

6.1. Evaluating successor generation strategies 

The goal of our first experiment was to assess the number of nodes explored by the A* search and 
the time taken as a function of the discovered path length, as a function of different successor generation 



Table 1: Dataset characteristics. 



Dataset 


#obj. 


# feat. 


# relations 


Sparsity 


Synthetic 


1,000 


3,679 


430,477 


99.88 


Netflix 


17,770 


480,189 


100,480,507 


99.98 


PubMed 


161,693 


133,252 


2,226,616 


99.99 


CI. 

Trials 


66,526 


75,656 


907,759 


99.99 



- 0.95, 6=20. time thresholds 20 seconds 
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Figure 5: Synthetic dataset: Comparison between successor generation strategies. 

strategies. For this purpose, we defined a synthetic dataset involving 1000 randomly selected movies from 
the Netflix dataset and aimed to generate more than 50,000 similarity paths between randomly selected pairs, 
with a 9 threshold of 0.95 and a branching factor bound (6) of 20. We introduced a time threshold of 120 
seconds in the A* search and Fig. [5] depicts the results of only successful searches. Around 80% of these 
searches failed when using the the cover tree and NNA approaches, but the fcCNN approach was able to 
either successfully generate paths or to declare that no path exists in less than 120 seconds. As Fig.[5]shows, 
the cover tree and NNA terminate early due to the applied time constraint but A;CNN generated long paths 
of length 14 and 13 with k=2 and 7, respectively. The runtime trends shown in Fig. [5] (right) also mirror the 
number of nodes explored in Fig.[5](left). 

This result is not surprising, as the cover tree algorithm does not factor in the clique constraint, thus 
preventing it from taking advantage of the search space reduction that this constraint provides. NNA does 
take advantage of this constraint, however, and it generates a strict ordering on the Jaccard's coefficient over 
the cliques, whereas /cCNN simply generates some b candidate cliques. In practice, the fcCNN relaxation 
results in the discovery of candidate cliques more rapidly than the NNA algorithm, while still remaining 
accurate as in both cases a post processing step is necessary to determine if a given candidate does in fact 
meet the search threshold. Through the rest of this paper, we thus use the /cCNN algorithm as it generally 
provides the best performance of the three algorithms. 

6.2. Netflix dataset 

Viewing movies as objects and users as features, we construct the concept lattice for the Netflix dataset 
with a support constraint of 20%. The resulting lattice contains 5,884 concepts, 120 of whom were leaves 
and the rest had child concepts. We picked 50,000 pairs of movies and attempted to generate hammock 
paths between them. 

Figure [6] shows our experimental results with varying clique size and fixed distance threshold. From the 
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Figure 6: Netflix: Explorations by varying clique size requirement at a fixed Soergel threshold (9=0.9). The first three 
plots show average run time, number of nodes explored, and effective branching factor, as a function of path lengths, 
for different clique size requirements. The fourth plot shows that the size of the longest path reduces as the clique size 
increases. 
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Figure 7: Netflix: Hammock paths from 1943 classic drama "Titanic" to the action movie "Die Hard 2: Die Harder" 
for four different clique size requirements. 



graph at the left, it is evident that the performance of our kCNN algorithm depends on the clique size since 
the graph shows that hammock paths with lower clique sizes are generated faster than those with higher 
clique sizes. The tendency of the run time basically follows the required number of nodes explored to gen- 
erate the paths (the second graph). The third graph shows the corresponding trends of effective branching 
factor. It is evident that the higher the clique size the worse (or larger) the effective branching factor is. 
Effective branching factor is a measure to understand the size of the traversed search space compared with 
corresponding breadth first search (BFS). Therefore it is a measure of the efficacy of the heuristic in pruning 
out unwanted paths. Although the effective branching factor becomes worse with larger clique size, it gener- 
ates smaller hammock paths. Thus, the lengths of the paths are also affected by the clique size requirement. 
The plot at the right depicts the lengths of the hammock paths as a function of clique size. It demonstrates 
that our algorithm has the capability to generate longer chains with smaller cliques. For example, Figure [7] 
gives hammock paths between the 1943 classic drama "Titanic" and the action movie "Die Hard 2: Die 
Harder", for four different clique sizes (k=9, 12, 15 and 18). 



6.3. PubMed Abstracts 



In our PubMed case study, we view paper abstracts as objects and terms as features. We curated 161,693 
publications related to several cytokines, transcription factors and feedback molecules and modeled only 
their titles and abstracts. There were around 133,252 unique stemmed terms in the dataset after the removal 
of stop words, numerals, DNA sequences, and special characters. Again, for randomly selected pairs of 
papers, we discovered over a million hammock paths. 

Figure 8(a) shows the trends of number of nodes explored, time to generate paths, and effective branch- 
ing factor, as a function of path length. Although Figure 8(a) shows that the average number of nodes 
explored is large with higher threshold, it is not necessary that this behavior would remain the same for 
other datasets. The result can in fact be opposite in some datasets where number of paths generated are not 
lessened by reducing 6, especially when each object has a large number of features. 

Figure 8(b) (left) shows that the use of Soergel distance heuristic saves significant object exploration 
over the vanilla BFS. The larger the clique size, the higher the percentage by which the heuristic reduces 
the number of node explorations. It shows that the use of the straight line Soergel distance as the heuristic 
saves more than 300% node exploration by the A* procedure over the BFS (h=0), for a clique size k=l4. 
The saved amount is more than 100% even with the smallest clique size k=2. The average time to generate 
hammock path also reduces due to the saved node exploration. Figure 8(b) (middle) shows that the heuristic 
saves more than 800% runtime with clique size k=\4. Even with a clique size k=2, the savings are near 
200%. In the best case with A;=14, the heuristic also improves the effective branching factor by 90% over 
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(a) Varying threshold 9, fixed clique size (fc=6). (Left) Number of nodes explored by the A* procedure. It shows that the application of lower 
thresholds results in a lower number of explored nodes, as a function of path length. (Middle) The higher the distance threshold, the higher the 
run time to generate paths. (Right) Lower distance threshold has the trend to generate paths with low effective branching factor. 
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(b) The use of Soergel distance heuristic results in better performance in terms of number of nodes explored, runtime and effective branching 
factor. The dashed line in each of the plots shows the percentage improvement due to the use of heuristic over plain BFS with h=Q. 
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(c) Mixed mode operation: (Left) the mixed mode heuristic provides lesser traversal in the similarity network over the corresponding BSF. 
(Middle) the mixed mode heuristic saves larger percentage of time with larger cliques over the corresponding BFS. (Right) The effective 
branching factor improvement is higher with larger cliques. 

Figure 8: Some illustrative experimental results using the PubMed dataset. 



the BFS shown in the plot at the right. In the worst case with k=2, it offers around 4% improvement of the 
effective branching factor. 

Despite the use of truncated dataset, Figure 8(c) shows that the mixed mode gains due to the heuristic 
over the BFS have a similar trend to the normal mode of Figure |8(b) Therefore, the mixed mode offers a 
practical mechanism to provide the best possible gains from lossy datasets without time consuming remod- 
eling of the vector space (e.g., |0 uses costly remodelling as a post processing step). 



Table 2: Clinical trials: A hammock path connecting congestive heart failure and kidney complications. 



Trial ID 


Short Title of the Trial 


00103519 


Study or DITPA in Patients With Congestive Heart Failure. 


00696631 


European Trial of Dronedarone in Moderate to Severe Congestive Heart Failure. 


00697086 


Study of Dronedarone in Atrial Fibrillation. 


00744874 


Ablation of the Pulmonary Veins for Paroxysmal Afib. 


00807586 


Corticosteroid Pulse After Ablation. 


00030563 


Surgery With or Without Radio frequency Ablation Followed by Irinotecan in Treating 
Patients With Colorectal Cancer that is Metastatic to the Liver. 


00003753 


Floxuridine, Dexamethasone, and Irinotecan After Surgery in Treating Patients With Liver 
Metastases From Colorectal Cancer. 


00005818 


SU5416 and Irinotecan in Treating Patients With Advanced Colorectal Cancer. 


00002828 


Chemotherapy With Raltitrexed and Fluorouracil in Treating Patients With Advanced 
Colorectal Cancer. 


00449137 


Arsenic Trioxide, Fluorouracil, and Leucovorin in Treating Patients With Stage IV 
Colorectal Cancer That Has Relapsed or Not Responded to Treatment. 


00124605 


Arsenic Trioxide and Pamidronate in Treating Patients With Advanced Solid Tumors or 
Multiple Myeloma. 


00302627 


Pamidronate, Vitamin D, and Calcium for the Bone Disease of Kidney and Heart 
Transplantation. 


00074516 


Kidney Transplantation in Patients With Cystinosis. 



6.4. Clinical Trials 

We curated more than 60 thousand clinical trials from clinicaltrials.gov and concentrated on the purpose 
and description of each trial. Here trials are objects and terms are features. The concept lattice we used 
had around a thousand unique concepts and was generated with minsup=10%. Due to space restrictions, we 
show qualitative rather than quantitative results here. 

Table [2] describes a significant chain of trials connecting two disparate clinical studies related to conges- 
tive heart failure and kidney transplantation in patients with cystinosis. (One accepted practice to assess the 
statistical significance of discovered chains is, for each step in the path, to assess the likelihood of overlap for 
the given descriptor sizes using the hypergeometric distribution and attribute a p- value after FDR corrections 
such the Benjamini-Hochberg procedure.) The chain starts with heart failure trials and goes through studies 
on atrial fibrillation (abnormal heart rhythm), cardiac ablation, colorectal cancer, advanced solid tumors, 
and eventually reaches clinical study on kidney transplantation in patients with cystinosis. Such connections 
between cardio-vascular disease and kidney failures are an intense topic of current research (e.g., see 01). 

7. Discussion 

We have presented an efficient algorithmic approach to discover hammock paths in similarity networks 
without inducing these networks in their entirety. The experimental results have demonstrated scalability, 
effectiveness of our heuristics, and ability to yield domain-specific insight. We posit that our approach can 
be a useful information exploration tool for understanding the structure of connectivities underlying boolean 
and vector-valued association datasets. Future work is geared toward incorporating additional distance 
measures and defining new compressed representations of datasets that can serve multiple uses, from concept 
modeling to distance estimation. 
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